Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRILL-8474: Add Daffodil Format Plugin #2836

Closed
wants to merge 8 commits into from

Conversation

mbeckerle
Copy link
Contributor

@mbeckerle mbeckerle commented Oct 14, 2023

Adding Daffodil to Drill as a 'contrib'

Requires Daffodil 3.7.0-SNAPSHOT which has metadata support we're using.

New format-daffodil module created

Still uses absolute paths for the schemaFileURI. (which is cheating. Wouldn't work in a true distributed drill environment.)

We have yet to work out how to enable Drill to provide access for DFDL schemas in XML form with include/import to be resolved.

The input data stream is, however, being accessed in the proper Drill manner. Gunzip happened automatically. Nice.

Note: Fix boxed Boolean vs. boolean problem. Don't use boxed primitives in Format config objects.

Tests show Daffodil works for data as complex as having nested repeating sub-records.

These DFDL types are supported:

  • int
  • long
  • short
  • byte
  • boolean
  • double
  • float (does not work. Bug DAFFODIL-2367)
  • hexBinary
  • string

#2835

@cgivre cgivre added enhancement PRs that add a new functionality to Drill new-format New Format Plugin doc-impacting PRs that affect the documentation labels Oct 17, 2023
Copy link
Contributor

@cgivre cgivre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbeckerle
I mistakenly pushed some code cleanup I did directly to your branch. I apologize for that. In any event, I added some comments to the BatchReader and FormatPlugin which I think will help you get unblocked.

dafParser.setInfosetOutputter(outputter);
// Lastly, we open the data stream
try {
dataInputStream = dataInputURI.toURL().openStream();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'm not sure why we need to do this. Drill can get you an input stream of the input file.
All you need to do is:

dataInputStream = negotiator.file().fileSystem().openPossiblyCompressedStream(negotiator.file().split().getPath());

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the data files this works.

For schemas, this will not be a solution even temporarily. Daffodil loads schemas from the classpath. Large schemas are complex objects, akin to a software system with dependencies expressed via XML Schema include/import statements with schemaLocation attributes that contain relative URLs or "absolute" URLs where absolute means relative to some root of some jar file on the classpath.

Even simple DFDL schemas are routinely spread over a couple jars.

@cgivre cgivre marked this pull request as draft October 19, 2023 02:20
contrib/pom.xml Outdated Show resolved Hide resolved
@cgivre
Copy link
Contributor

cgivre commented Oct 29, 2023

@mbeckerle Looks like you're making good progress!

@mbeckerle mbeckerle force-pushed the daffodil-2835 branch 2 times, most recently from eb418bf to c36cc07 Compare November 7, 2023 00:00
@mbeckerle mbeckerle changed the title WIP: Preliminary Review on adding Daffodil to Drill DRILL-2835: Daffodil Feature for Drill Dec 22, 2023
@mbeckerle
Copy link
Contributor Author

This is pretty much working now, in terms of constructing drill metadata from DFDL schemas, and
Daffodil delivering data to Drill.

There were dozens of commits to get here, so I squashed them as they were no longer helpful.

Obviously more test are needed, but the ones there show nested subrecords working.

The issues like how schemas get distributed, and how Daffodil gets invoked in parallel by drill are still open.

@mbeckerle mbeckerle marked this pull request as ready for review December 22, 2023 00:43
3.7.0-SNAPSHOT of Daffodil which has metadata support we're
using.

New format-daffodil module created

Still uses absolute paths for the schemaFileURI.
(which is cheating. Wouldn't work in a true distributed
drill environment.)

We have yet to work out how to enable Drill to provide
access for DFDL schemas in XML form with include/import
to be resolved.

The input data stream is, however, being accessed in the
proper Drill manner. Gunzip happened automatically. Nice.

Note: Fix boxed Boolean vs. boolean problem. Don't use
boxed primitives in Format config objects.

Test show this works for data as complex as having
nested repeating sub-records.

These DFDL types are supported:

- int
- long
- short
- byte
- boolean
- double
- float (does not work. Bug DAFFODIL-2367)
- hexBinary
- string

apache#2835
@mbeckerle
Copy link
Contributor Author

Rebased onto latest Drill master as of 2023-12-21 (force pushed one more time)

Note that this is never going to pass automated tests until the Daffodil release this depends on is official (currently it needs a locally build Daffodil 3.7.0-snapshot, though the main daffodil branch has the changes integrated so any 3.7.0-snapshot build will work.

Copy link
Contributor

@cgivre cgivre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Mike,
This is looking good. I have some minor comments, mostly formatting. It seems like the next step would be to figure out where and how we store the DFDL files.

extends InfosetOutputter {

private boolean isOriginalRoot() {
boolean result = currentTupleWriter() == rowSetWriter;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the Drill coding style defined in a wiki or other doc page somewhere? I didn't find one.

If this is just java-standard, then I need reminding, as I have not coded Java prior to this effort for 12+ years now.

@mbeckerle
Copy link
Contributor Author

@cgivre yes, the next architectural-level issue is how to get a compiled DFDL schema out to everyplace Drill will run a Daffodil parse. Every one of those JVMs needs to reload it.

I'll do the various cleanups and such. The one issue I don't know how to fix is the "typed setter" vs. (set-object) issue, so if you could steer me in the right direction on that it would help.

@paul-rogers
Copy link
Contributor

paul-rogers commented Jan 3, 2024 via email

@cgivre cgivre changed the title DRILL-2835: Daffodil Feature for Drill DRILL-8474: Add Daffodil Format Plugin Jan 3, 2024
Date, Time, DateTime, Boolean, Unsigned integers, Integer, NonNegativeInteger,Decimal, float, double, hexBinary.
@mbeckerle
Copy link
Contributor Author

mbeckerle commented Jan 4, 2024 via email

I imported the dev-support/formatter/eclipse settings and used them to reformat the code in IntelliJ IDEA.

No functional changes in this commit.
@mbeckerle
Copy link
Contributor Author

This is ready for a next review. All the scalar types are now implemented with typed setter calls.

The prior review comments have all been addressed I believe.

Remaining things to do include:

  1. How to get the compiled DFDL schema object so it can be loaded by daffodil out at the distributed Drill nodes.
  2. Test of nilled values (and more tests generally to show deeply nested and repeating nested objects work.)
  3. Errors - revisit every place errors are detected or thrown to make sure these are being done the right way for DFDL schema compilation and runtime errors as well.

@cgivre
Copy link
Contributor

cgivre commented Jan 5, 2024

@mbeckerle I had a thought about your TODO list. See inline.

This is ready for a next review. All the scalar types are now implemented with typed setter calls.

The prior review comments have all been addressed I believe.

Remaining things to do include:

  1. How to get the compiled DFDL schema object so it can be loaded by daffodil out at the distributed Drill nodes.

I was thinking about this and I remembered something that might be useful. Drill has support for User Defined Functions (UDF) which are written in Java. To add a UDF to Drill, you also have to write some Java classes in a particular way, and include the JARs. Much like the DFDL class files, the UDF JARs must be accessible to all nodes of a Drill cluster.

Additionally, Drill has the capability of adding UDFs dynamically. This feature was added here: #574. Anyway, I wonder if we could use a similar mechanism to load and store the DFDL files so that they are accessible to all Drill nodes. What do you think?

  1. Test of nilled values (and more tests generally to show deeply nested and repeating nested objects work.)
  2. Errors - revisit every place errors are detected or thrown to make sure these are being done the right way for DFDL schema compilation and runtime errors as well.

@mbeckerle
Copy link
Contributor Author

@mbeckerle I had a thought about your TODO list. See inline.

This is ready for a next review. All the scalar types are now implemented with typed setter calls.
The prior review comments have all been addressed I believe.
Remaining things to do include:

  1. How to get the compiled DFDL schema object so it can be loaded by daffodil out at the distributed Drill nodes.

I was thinking about this and I remembered something that might be useful. Drill has support for User Defined Functions (UDF) which are written in Java. To add a UDF to Drill, you also have to write some Java classes in a particular way, and include the JARs. Much like the DFDL class files, the UDF JARs must be accessible to all nodes of a Drill cluster.

Additionally, Drill has the capability of adding UDFs dynamically. This feature was added here: #574. Anyway, I wonder if we could use a similar mechanism to load and store the DFDL files so that they are accessible to all Drill nodes. What do you think?

Excellent: So drill has all the machinery, it's just a question of repackaging it so it's available for this usage pattern, which is a bit different from Drill's UDFs, but also very similar.

There are two user scenarios which we can call production and test.

  1. Production: binary compiled DFDL schema file + code jars for Daffodil's own UDFs and "layers" plugins. This should, ideally, cache the compiled schema and not reload it for every query (at every node), but keep the same loaded instance in memory in a persistant JVM image on each node. For large production DFDL schemas this is the only sensible mechanism as it can take minutes to compile large DFDL schemas.

  2. Test: on-the-fly centralized compilation of DFDL schema (from a combination of jars and files) to create and cache (to avoid recompiling) the binary compiled DFDL schema file. Then using that compiled binary file, as item 1. For small DFDL schemas this can be fast enough for production use. Ideally, if the DFDL schema is unchanged this would reuse the compiled binary file, but that's an optimization that may not matter much.

Kinds of objects involved are:

  • Daffodil plugin code jars
  • DFDL schema jars
  • DFDL schema files (just not packaged into a jar)
  • Daffodil compiled schema binary file
  • Daffodil config file - parameters, tunables, and options needed at compile time and/or runtime

Code jars: Daffodil provides two extension features for DFDL users - DFDL UDFs and DFDL 'layers' (ex: plug-ins for uudecode, or gunzip algorithms used in part of the data format). Those are ordinary compiled class files in jars, so in all scenarios those jars are needed on the node class path if the DFDL schema uses them. Daffodil dynamically finds and loads these from the classpath in regular Java Service-Provider Interface (SPI) mechanisms.

Schema jars: Daffodil packages DFDL schema files (source files i.e., mySchema.dfdl.xsd) into jar files to allow inter-schema dependencies to be managed using ordinary jar/java-style managed dependencies. Tools like sbt and maven can express the dependencies of one schema on another, grab and pull them together, etc. Daffodil has a resolver so when one schema file referenes another with include/import it searches the class path directories and jars for the files.

Schema jars are only needed centrally when compiling the schema to a binary file. All references to the jar files for inter-schema file references are compiled into the compiled binary file.

It is possible for one DFDL schema 'project' to define a DFDL schema, along with the code for a plugin like a Daffodil UDF or layer. In that case the one jar created is both a code jar and a schema jar. The schema jar aspects are used when the schema is compiled and ignored at Daffodil runtime. The code jar aspects are used at Daffodil run time and ignored at schema compilation time. So such a jar that is both code and schema jar needs to be on the class path in both places, but there's no interaction of the two things.

Binary Compiled Schema File: Centrally, DFDL schemas in files and/or jars are compiled to create a single binary object which can be reloaded in order to actually use the schema to parse/unparse data.

  • These binary files are tied to a specific version+build of Daffodil. (They are just a java object serialization of the runtime data structures used by Daffodil).
  • Once reloaded into a JVM to create a Daffodil DataProcessor object, that object is read-only so thread safe, and can be shared by parse calls happening on many threads.

Daffodil Config File: This contains settings like what warnings to suppress when compiling and/or at runtime, tunables, such as how large to allow a regex match attempt, maximum parsed data size limit, etc. This also is needed both at schema compile and at runtime, as the same file contains parameters for both DFDL schema compile time and runtime.


private void loadSchema(URI schemaFileURI) throws IOException, InvalidParserException {
Compiler c = Daffodil.compiler();
dp = c.reload(Channels.newChannel(schemaFileURI.toURL().openStream()));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cgivre This reload call is the one that has to happen on every drill node.
It needs only to happen once for that schema for the life of the JVM. The "dp" object created here can be reused every time that schema is needed to parse more data. The dp (DataProcessor) is a read only (thread safe) data structure.

As you see, this can throw exceptions, so the question of how those situations should be handled arises.
Even if drill perfectly makes the file available to every node for this, that would rule out the IOException due to file not found or access rights, but a user can create a compiled DFDL schema binary file using the wrong version of the Daffodil schema compiler which is a mismatch for the runtime; hence, it is possible for the InvalidParserException to be thrown.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This definitely seems like an area where there is potential for a lot of different things to go wrong. My view is we should just do our best to provide clear error messages so that the user can identify and fix the issues.

try {
dmp.loadSchema(schemaFileURI);
} catch (IOException | InvalidParserException e) {
throw new CompileFailure(e);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error architecture?

This loadSchema call needs to happen on every node, and so has the potential (if the loaded binary schema file is no good or mismatches the Daffodil library version) to fail. Is throwing this exception the right thing here or are other steps preferred/necessary?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought here would be to fail as quickly as possible. If the DFDL schema can't be read, I'm assuming that we cannot proceed, so throwing an exception would be the right thing to do IMHO. With that said, we should make sure we provide a good error message that would explain what went wrong.
One of the issues we worked on for a while with Drill was that it would fail and you'd get a stack trace w/o a clear idea of what the actual issue is and how to rectify it.

.addContext(errorContext).build(logger);
}
if (dafParser.isValidationError()) {
logger.warn(dafParser.getDiagnosticsAsString());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need an option here to convert validation errors to fatal?

Will logger.warn be seen by a query user, or is that just for someone dealing with the logs?

Validation errors either should be escalated to fatal, OR they should be visible in the query output display to a user somehow.

Either way, users will need a mechanism to suppress validation errors that prove to be unavoidable since they could be common place. Nodody wants thousands of warnings about something they can't avoid that doesn't stop parsing and querying the data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbeckerle The question I'd have is whether the query can proceed if validation fails. (I don't know the answer)
If the answer is no, then we need to halt execution ASAP and throw an exception. If the answer is it can proceed, but the data might be less than ideal, maybe we add a configuration option which will allow the user to decide the behavior on a validation failure.

I could imagine situations where you have Drill unable to read a huge file because someone fat fingered a quotation mark somewhere or something like that. In a situation like that, sometimes you might just want to say I'll accept a row or two of bad data just so I can read the whole file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree.

We draw a distinction between "well formed" and "invalid" data and whether one does validation seems like the right switch in daffodil to use.

If data is malformed, that means you can't successfully parse it. If it is invalid, that just means values are unexpected. Example: A 3 digit number representing a percentage 0 to 100. -1 is invalid, ABC is malformed.

If data is not well formed, you really cannot continue parsing it, as you cannot convert it to the type expected. But, if you are able to determine at least how big it is, it's possible to capture that length of data into a dummy "badData" element which is always invalid (so isn't a "false positive" parse). This capability has to be designed into the DFDL schema, but it is something we've been doing more and more.

Hence, one can tolerate even some malformed data. If it is malformed to where you cannot determine the length, then continuing is impossible.

We will see if more than this is needed. Options like the "use all strings/varchar" or all numbers are float, which you have for toleratng situations with other data connectors may prove useful, particularly while a DFDL schema is in development and you are really just testing it (and the corresponding data) using Drill.

@cgivre
Copy link
Contributor

cgivre commented Jan 7, 2024

@mbeckerle
With respect to style, I tried to reply to that comment, but the thread won't let me. In any event, Drill classes will typically start with the constructor, then have whatever methods are appropriate for the class. The logger creation usually happens before the constructor. I think all of your other classes followed this format, so the one or two that didn't kind of jumped out at me.

@cgivre
Copy link
Contributor

cgivre commented Jan 7, 2024

@mbeckerle I had a thought about your TODO list. See inline.

This is ready for a next review. All the scalar types are now implemented with typed setter calls.
The prior review comments have all been addressed I believe.
Remaining things to do include:

  1. How to get the compiled DFDL schema object so it can be loaded by daffodil out at the distributed Drill nodes.

I was thinking about this and I remembered something that might be useful. Drill has support for User Defined Functions (UDF) which are written in Java. To add a UDF to Drill, you also have to write some Java classes in a particular way, and include the JARs. Much like the DFDL class files, the UDF JARs must be accessible to all nodes of a Drill cluster.
Additionally, Drill has the capability of adding UDFs dynamically. This feature was added here: #574. Anyway, I wonder if we could use a similar mechanism to load and store the DFDL files so that they are accessible to all Drill nodes. What do you think?

Excellent: So drill has all the machinery, it's just a question of repackaging it so it's available for this usage pattern, which is a bit different from Drill's UDFs, but also very similar.

There are two user scenarios which we can call production and test.

  1. Production: binary compiled DFDL schema file + code jars for Daffodil's own UDFs and "layers" plugins. This should, ideally, cache the compiled schema and not reload it for every query (at every node), but keep the same loaded instance in memory in a persistant JVM image on each node. For large production DFDL schemas this is the only sensible mechanism as it can take minutes to compile large DFDL schemas.
  2. Test: on-the-fly centralized compilation of DFDL schema (from a combination of jars and files) to create and cache (to avoid recompiling) the binary compiled DFDL schema file. Then using that compiled binary file, as item 1. For small DFDL schemas this can be fast enough for production use. Ideally, if the DFDL schema is unchanged this would reuse the compiled binary file, but that's an optimization that may not matter much.

Kinds of objects involved are:

  • Daffodil plugin code jars
  • DFDL schema jars
  • DFDL schema files (just not packaged into a jar)
  • Daffodil compiled schema binary file
  • Daffodil config file - parameters, tunables, and options needed at compile time and/or runtime

Code jars: Daffodil provides two extension features for DFDL users - DFDL UDFs and DFDL 'layers' (ex: plug-ins for uudecode, or gunzip algorithms used in part of the data format). Those are ordinary compiled class files in jars, so in all scenarios those jars are needed on the node class path if the DFDL schema uses them. Daffodil dynamically finds and loads these from the classpath in regular Java Service-Provider Interface (SPI) mechanisms.

Schema jars: Daffodil packages DFDL schema files (source files i.e., mySchema.dfdl.xsd) into jar files to allow inter-schema dependencies to be managed using ordinary jar/java-style managed dependencies. Tools like sbt and maven can express the dependencies of one schema on another, grab and pull them together, etc. Daffodil has a resolver so when one schema file referenes another with include/import it searches the class path directories and jars for the files.

Schema jars are only needed centrally when compiling the schema to a binary file. All references to the jar files for inter-schema file references are compiled into the compiled binary file.

It is possible for one DFDL schema 'project' to define a DFDL schema, along with the code for a plugin like a Daffodil UDF or layer. In that case the one jar created is both a code jar and a schema jar. The schema jar aspects are used when the schema is compiled and ignored at Daffodil runtime. The code jar aspects are used at Daffodil run time and ignored at schema compilation time. So such a jar that is both code and schema jar needs to be on the class path in both places, but there's no interaction of the two things.

Binary Compiled Schema File: Centrally, DFDL schemas in files and/or jars are compiled to create a single binary object which can be reloaded in order to actually use the schema to parse/unparse data.

  • These binary files are tied to a specific version+build of Daffodil. (They are just a java object serialization of the runtime data structures used by Daffodil).
  • Once reloaded into a JVM to create a Daffodil DataProcessor object, that object is read-only so thread safe, and can be shared by parse calls happening on many threads.

Daffodil Config File: This contains settings like what warnings to suppress when compiling and/or at runtime, tunables, such as how large to allow a regex match attempt, maximum parsed data size limit, etc. This also is needed both at schema compile and at runtime, as the same file contains parameters for both DFDL schema compile time and runtime.

@mbeckerle Would you want to chat sometime next week and I can walk you through the UDF architecture? I don't know how relevant it would be, but you'd at least see how things are installed and so forth.

@mbeckerle
Copy link
Contributor Author

@mbeckerle With respect to style, I tried to reply to that comment, but the thread won't let me. In any event, Drill classes will typically start with the constructor, then have whatever methods are appropriate for the class. The logger creation usually happens before the constructor. I think all of your other classes followed this format, so the one or two that didn't kind of jumped out at me.

@cgivre I believe the style issues are all fixed. The build did not get any codestyle issues.

@cgivre
Copy link
Contributor

cgivre commented Jan 14, 2024

@mbeckerle With respect to style, I tried to reply to that comment, but the thread won't let me. In any event, Drill classes will typically start with the constructor, then have whatever methods are appropriate for the class. The logger creation usually happens before the constructor. I think all of your other classes followed this format, so the one or two that didn't kind of jumped out at me.

@cgivre I believe the style issues are all fixed. The build did not get any codestyle issues.

The issue I was referring to was more around the organization of a few classes. Usually we'll have the constructor (if present) at the top followed by any class methods. I think there was a class or two where the constructor was at the bottom or something like that. In any event, consider the issue resolved.

This significantly simplifies the metadata walking to convert Daffodil metadata to drill
metadata.
@mbeckerle
Copy link
Contributor Author

@cgivre @paul-rogers is there an example of a Drill UDF that is not part of the drill repository tree?

I'd like to understand the mechanisms for distributing any jar files and dependencies of the UDF that drill uses. I can't find any such in the quasi-USFs that are in the Drill tree, because well, since they are part of Drill, and so are their dependencies, this problem doesn't exist.

@cgivre
Copy link
Contributor

cgivre commented Jan 21, 2024

@cgivre @paul-rogers is there an example of a Drill UDF that is not part of the drill repository tree?

I'd like to understand the mechanisms for distributing any jar files and dependencies of the UDF that drill uses. I can't find any such in the quasi-USFs that are in the Drill tree, because well, since they are part of Drill, and so are their dependencies, this problem doesn't exist.

@mbeckerle Here's an example: https://github.com/datadistillr/drill-humanname-functions. I'm sorry we weren't able to connect last week.

@mbeckerle
Copy link
Contributor Author

@cgivre @paul-rogers is there an example of a Drill UDF that is not part of the drill repository tree?
I'd like to understand the mechanisms for distributing any jar files and dependencies of the UDF that drill uses. I can't find any such in the quasi-USFs that are in the Drill tree, because well, since they are part of Drill, and so are their dependencies, this problem doesn't exist.

@mbeckerle Here's an example: https://github.com/datadistillr/drill-humanname-functions. I'm sorry we weren't able to connect last week.

If I understand this correctly, if a jar is on the classpath and has drill-module.conf in its root dir, then drill will find it and read that HOCON file to get the package to add to drill.classpath.scanning.packages.

Drill then appears to scan jars for class files for those packages. Not sure what it is doing with the class files. I imagine it is repackaging them somehow so Drill can use them on the drill distributed nodes. But it isn't yet clear to me how this aspect works. Do these classes just get loaded on the distributed drill nodes? Or is the classpath augmented in some way on the drill nodes so that they see a jar that contains all these classes?

I have two questions:

(1) what about dependencies? The UDF may depend on libraries which depend on other libraries, etc.

(2) what about non-class files, e.g., things under src/main/resources of the project that go into the jar, but aren't "class" files? How do those things also get moved? How would code running in the drill node access these? The usual method is to call getResource(URL) with a URL that gives the path within a jar file to the resource in question.

Thanks for any info.

@cgivre
Copy link
Contributor

cgivre commented Jan 23, 2024

@cgivre @paul-rogers is there an example of a Drill UDF that is not part of the drill repository tree?
I'd like to understand the mechanisms for distributing any jar files and dependencies of the UDF that drill uses. I can't find any such in the quasi-USFs that are in the Drill tree, because well, since they are part of Drill, and so are their dependencies, this problem doesn't exist.

@mbeckerle Here's an example: https://github.com/datadistillr/drill-humanname-functions. I'm sorry we weren't able to connect last week.

If I understand this correctly, if a jar is on the classpath and has drill-module.conf in its root dir, then drill will find it and read that HOCON file to get the package to add to drill.classpath.scanning.packages.

I believe that is correct.

Drill then appears to scan jars for class files for those packages. Not sure what it is doing with the class files. I imagine it is repackaging them somehow so Drill can use them on the drill distributed nodes. But it isn't yet clear to me how this aspect works. Do these classes just get loaded on the distributed drill nodes? Or is the classpath augmented in some way on the drill nodes so that they see a jar that contains all these classes?

I have two questions:

(1) what about dependencies? The UDF may depend on libraries which depend on other libraries, etc.

So UDFs are a bit of a special case, but if they do have dependencies, you have to also include those JAR files in the UDF directory, or in Drill's 3rd party JAR folder. I'm not that good with maven, but I've often wondered about making a so-called fat-JAR which includes the dependencies as part of the UDF JAR file.

(2) what about non-class files, e.g., things under src/main/resources of the project that go into the jar, but aren't "class" files? How do those things also get moved? How would code running in the drill node access these? The usual method is to call getResource(URL) with a URL that gives the path within a jar file to the resource in question.

Take a look at this UDF. https://github.com/datadistillr/drill-geoip-functions
This UDF has a few external resources including a CSV file and the MaxMind databases.

Thanks for any info.

@mbeckerle
Copy link
Contributor Author

mbeckerle commented Jan 23, 2024

Ok, so the geo-ip UDF stuff has no special mechanisms or description about those resource files, so the generic code that "scans" must find them and drag them along automatically.

That's the behavior I want.

@cgivre What is "Drill's 3rd Party Jar folder"?

If a magic folder just gets dragged over to all nodes, and drill uses a class loader that arranges for jars in that folder to be searched, then there is very little to do, since a DFDL schema can be just a set of jar files containing related resources, and the classes for Daffodil's own UDFs and layers which are java code extensions of its own kind.

@mbeckerle
Copy link
Contributor Author

This now passes all the daffodil contrib tests using the published official Daffodil 3.7.0.

It does not yet run in any scalable fashion, but the metadata/data interfacing is complete.

I would like to squash this to a single commit before merging, and it needs to be tested rebased onto the latest Drill commit.

@mbeckerle
Copy link
Contributor Author

Creating a new squashed PR so as to avoid loss of the comments on this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc-impacting PRs that affect the documentation enhancement PRs that add a new functionality to Drill new-format New Format Plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants